Skip to content

K8SPG-680: add ReadyForBackup condition to the pg-cluster #1133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 18, 2025
Merged

Conversation

pooknull
Copy link
Contributor

@pooknull pooknull commented Apr 15, 2025

K8SPG-680 Powered by Pull Request Badge

https://perconadev.atlassian.net/browse/K8SPG-680

DESCRIPTION

Problem:
After a failed PVC resize on cluster1-repo1, scheduled backups cannot be created successfully. Although the pg-backup object is created, it gets stuck in the Starting state.

Cause:
When a PVC resize fails, the crunchy's PostgresCluster resource gets an Unknown status for the PGBackRestReplicaRepoReady condition. This condition is required to create a backup job in the reconcileManualBackup method:

// determine if the dedicated repository host is ready using the repo host ready
// condition, and return if not
repoCondition := meta.FindStatusCondition(postgresCluster.Status.Conditions, ConditionRepoHostReady)
if repoCondition == nil || repoCondition.Status != metav1.ConditionTrue {
return nil
}
// Determine if the replica create backup is complete and return if not. This allows for proper
// orchestration of backup Jobs since only one backup can be run at a time.
backupCondition := meta.FindStatusCondition(postgresCluster.Status.Conditions,
ConditionReplicaCreate)
if backupCondition == nil || backupCondition.Status != metav1.ConditionTrue {
return nil
}

As a result, the operator waits indefinitely for the backup job to appear:

if errors.Is(err, ErrBackupJobNotFound) {
log.Info("Waiting for backup to start")
return reconcile.Result{RequeueAfter: time.Second * 5}, nil
}
return reconcile.Result{}, errors.Wrap(err, "find backup job")

Solution:

  • Add a new .status.conditions field to the PerconaPGCluster resource.
  • If the required conditions in the PostgresCluster resource (PGBackRestRepoHostReady and PGBackRestReplicaCreate) are not True, a new ReadyForBackup condition is added to PerconaPGCluster with the False status.
  • If ReadyForBackup is False, the operator will skip the scheduled backup creation and log a message instead.
  • When a new PerconaPGBackup resource is created and the operator is waiting for its backup job to appear, it will check the ReadyForBackup condition. If it was set to False more than 2 minutes ago, the backup will be marked as Failed.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported PG version?
  • Does the change support oldest and newest supported Kubernetes version?

@pooknull pooknull marked this pull request as ready for review April 16, 2025 11:18
Comment on lines 27 to 36
func (f *fakeClient) Patch(ctx context.Context, obj client.Object, patch client.Patch, options ...client.PatchOption) error {
err := f.Client.Patch(ctx, obj, patch, options...)
if !k8serrors.IsNotFound(err) {
return err
}
if err := f.Create(ctx, obj); err != nil {
return err
}
return f.Client.Patch(ctx, obj, patch, options...)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? By removing it nothing fails on the controller tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -505,7 +470,7 @@ func updatePGBackrestInfo(ctx context.Context, c client.Client, pod *corev1.Pod,
}

func finishBackup(ctx context.Context, c client.Client, pgBackup *v2.PerconaPGBackup, job *batchv1.Job) (*reconcile.Result, error) {
if checkBackupJob(job) == v2.BackupSucceeded {
if job != nil && checkBackupJob(job) == v2.BackupSucceeded {
Copy link
Contributor

@gkech gkech Apr 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe validate the input job once at the top of the function and avoid repeating the same check across different places?

e.g.

func finishBackup(ctx context.Context, c client.Client, pgBackup *v2.PerconaPGBackup, job *batchv1.Job) (*reconcile.Result, error) {
	if job == nil {
		// do something
	}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will be clearer to repeat this check. We can't change the order of each action in this function, so if we add the if job == nil check at the top, it will just have a lot of duplicated code.

@hors hors merged commit 56cbc36 into main Apr 18, 2025
15 of 16 checks passed
@hors hors deleted the K8SPG-680 branch April 18, 2025 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants